Predicting the English Premier League!

John Wilshire (z3421072), Jono Chan (TODO)

17/08/2017

English premier league

Goal: * Predict the winner of a match!!

We found a Huge soccer database:

That was scraped from:

Total dataset:

Games from 2008/2009 season up to - 2015/2016 season

We took a subset of this, the EPL

##                 [,1]        
## number_of_games "3040"      
## first_game      "2008-08-16"
## last_game       "2016-05-17"
## number_of_teams "34"

Exploratory Data analysis

Distribution of scores

Soccer is very low scoring

How the team has done in previous games this season?

For each team we take the sum of their margin for all games ina season.

For each team we have their 11 player lineup:

## Observations: 3,040
## Variables: 32
## $ season           <chr> "2008/2009", "2008/2...
## $ stage            <int> 1, 1, 1, 1, 1, 1, 1,...
## $ date             <chr> "2008-08-17 00:00:00...
## $ match_api_id     <int> 489042, 489043, 4890...
## $ home_team_api_id <int> 10260, 9825, 8472, 8...
## $ away_team_api_id <int> 10261, 8659, 8650, 8...
## $ home_team_goal   <int> 1, 1, 0, 2, 4, 2, 2,...
## $ away_team_goal   <int> 1, 0, 1, 1, 2, 3, 1,...
## $ home_player_1    <int> 30726, 23686, 32562,...
## $ home_player_2    <int> 30362, 26111, 38836,...
## $ home_player_3    <int> 30620, 38835, 24446,...
## $ home_player_4    <int> 30865, 30986, 24408,...
## $ home_player_5    <int> 32569, 31291, 36786,...
## $ home_player_6    <int> 24148, 31013, 38802,...
## $ home_player_7    <int> 34944, 30935, 24655,...
## $ home_player_8    <int> 30373, 39297, 17866,...
## $ home_player_9    <int> 24154, 26181, 30352,...
## $ home_player_10   <int> 24157, 30960, 23927,...
## $ home_player_11   <int> 30829, 36410, 24410,...
## $ away_player_1    <int> 24224, 36373, 30660,...
## $ away_player_2    <int> 25518, 36832, 37442,...
## $ away_player_3    <int> 24228, 23115, 30617,...
## $ away_player_4    <int> 30929, 37280, 24134,...
## $ away_player_5    <int> 29581, 24728, 414792...
## $ away_player_6    <int> 38807, 24664, 37139,...
## $ away_player_7    <int> 40565, 31088, 30618,...
## $ away_player_8    <int> 30360, 23257, 40701,...
## $ away_player_9    <int> 33852, 24171, 24800,...
## $ away_player_10   <int> 34574, 25922, 24635,...
## $ away_player_11   <int> 37799, 27267, 30853,...
## $ year             <dbl> 2008, 2008, 2008, 20...
## $ outcome          <chr> "Draw", "Home", "Awa...

We can join this with the player statisitcs table (FIFA data):

## Observations: 183,978
## Variables: 42
## $ id                  <int> 1, 2, 3, 4, 5, 6,...
## $ player_fifa_api_id  <int> 218353, 218353, 2...
## $ player_api_id       <int> 505942, 505942, 5...
## $ date                <chr> "2016-02-18 00:00...
## $ overall_rating      <int> 67, 67, 62, 61, 6...
## $ potential           <int> 71, 71, 66, 65, 6...
## $ preferred_foot      <chr> "right", "right",...
## $ attacking_work_rate <chr> "medium", "medium...
## $ defensive_work_rate <chr> "medium", "medium...
## $ crossing            <int> 49, 49, 49, 48, 4...
## $ finishing           <int> 44, 44, 44, 43, 4...
## $ heading_accuracy    <int> 71, 71, 71, 70, 7...
## $ short_passing       <int> 61, 61, 61, 60, 6...
## $ volleys             <int> 44, 44, 44, 43, 4...
## $ dribbling           <int> 51, 51, 51, 50, 5...
## $ curve               <int> 45, 45, 45, 44, 4...
## $ free_kick_accuracy  <int> 39, 39, 39, 38, 3...
## $ long_passing        <int> 64, 64, 64, 63, 6...
## $ ball_control        <int> 49, 49, 49, 48, 4...
## $ acceleration        <int> 60, 60, 60, 60, 6...
## $ sprint_speed        <int> 64, 64, 64, 64, 6...
## $ agility             <int> 59, 59, 59, 59, 5...
## $ reactions           <int> 47, 47, 47, 46, 4...
## $ balance             <int> 65, 65, 65, 65, 6...
## $ shot_power          <int> 55, 55, 55, 54, 5...
## $ jumping             <int> 58, 58, 58, 58, 5...
## $ stamina             <int> 54, 54, 54, 54, 5...
## $ strength            <int> 76, 76, 76, 76, 7...
## $ long_shots          <int> 35, 35, 35, 34, 3...
## $ aggression          <int> 71, 71, 63, 62, 6...
## $ interceptions       <int> 70, 70, 41, 40, 4...
## $ positioning         <int> 45, 45, 45, 44, 4...
## $ vision              <int> 54, 54, 54, 53, 5...
## $ penalties           <int> 48, 48, 48, 47, 4...
## $ marking             <int> 65, 65, 65, 62, 6...
## $ standing_tackle     <int> 69, 69, 66, 63, 6...
## $ sliding_tackle      <int> 69, 69, 69, 66, 6...
## $ gk_diving           <int> 6, 6, 6, 5, 5, 14...
## $ gk_handling         <int> 11, 11, 11, 10, 1...
## $ gk_kicking          <int> 10, 10, 10, 9, 9,...
## $ gk_positioning      <int> 8, 8, 8, 7, 7, 9,...
## $ gk_reflexes         <int> 8, 8, 8, 7, 7, 12...

Using the closest assessment before the game we are trying to predict we can then aggregate and use these scores as a measure of how well we think a team is.

## Observations: 3,040
## Variables: 81
## $ stage                        <int> 1, 1, 1, 1, 1, 1, ...
## $ date                         <chr> "2008-08-17 00:00:...
## $ match_api_id                 <int> 489042, 489043, 48...
## $ home_team_api_id             <int> 10260, 9825, 8472,...
## $ away_team_api_id             <int> 10261, 8659, 8650,...
## $ home_team_goal               <int> 1, 1, 0, 2, 4, 2, ...
## $ away_team_goal               <int> 1, 0, 1, 1, 2, 3, ...
## $ year                         <dbl> 2008, 2008, 2008, ...
## $ outcome                      <chr> "Draw", "Home", "A...
## $ cumulative_margin_home       <dbl> 0, 0, 0, 0, 0, 0, ...
## $ cumulative_margin_away       <dbl> 0, 0, 0, 0, 0, 0, ...
## $ overall_rating_mean_home     <dbl> 82.45455, 78.00000...
## $ potential_mean_home          <dbl> 84.81818, 84.54545...
## $ crossing_mean_home           <dbl> 62.09091, 60.54545...
## $ finishing_mean_home          <dbl> 57.27273, 51.45455...
## $ heading_accuracy_mean_home   <dbl> 71.54545, 64.90909...
## $ short_passing_mean_home      <dbl> 72.81818, 65.45455...
## $ volleys_mean_home            <dbl> 59.00000, 59.18182...
## $ dribbling_mean_home          <dbl> 63.09091, 66.72727...
## $ curve_mean_home              <dbl> 57.36364, 60.09091...
## $ free_kick_accuracy_mean_home <dbl> 51.90909, 39.63636...
## $ long_passing_mean_home       <dbl> 72.36364, 60.09091...
## $ ball_control_mean_home       <dbl> 73.90909, 69.36364...
## $ acceleration_mean_home       <dbl> 76.81818, 78.45455...
## $ sprint_speed_mean_home       <dbl> 76.81818, 80.00000...
## $ agility_mean_home            <dbl> 65.63636, 70.18182...
## $ reactions_mean_home          <dbl> 79.90909, 76.81818...
## $ balance_mean_home            <dbl> 74.27273, 75.09091...
## $ shot_power_mean_home         <dbl> 66.18182, 61.18182...
## $ jumping_mean_home            <dbl> 75.45455, 75.81818...
## $ stamina_mean_home            <dbl> 80.81818, 77.81818...
## $ strength_mean_home           <dbl> 76.45455, 73.09091...
## $ long_shots_mean_home         <dbl> 56.81818, 50.54545...
## $ aggression_mean_home         <dbl> 75.90909, 63.81818...
## $ interceptions_mean_home      <dbl> 77.36364, 70.09091...
## $ positioning_mean_home        <dbl> 80.81818, 72.45455...
## $ vision_mean_home             <dbl> 73.00000, 68.36364...
## $ penalties_mean_home          <dbl> 79.54545, 69.63636...
## $ marking_mean_home            <dbl> 59.54545, 53.18182...
## $ standing_tackle_mean_home    <dbl> 62.81818, 56.18182...
## $ sliding_tackle_mean_home     <dbl> 60.63636, 60.63636...
## $ gk_diving_mean_home          <dbl> 16.81818, 13.36364...
## $ gk_handling_mean_home        <dbl> 27.54545, 26.72727...
## $ gk_kicking_mean_home         <dbl> 72.36364, 60.09091...
## $ gk_positioning_mean_home     <dbl> 27.81818, 25.90909...
## $ gk_reflexes_mean_home        <dbl> 27.00000, 27.00000...
## $ overall_rating_mean_away     <dbl> 75.27273, 71.18182...
## $ potential_mean_away          <dbl> 81.27273, 76.54545...
## $ crossing_mean_away           <dbl> 60.63636, 53.63636...
## $ finishing_mean_away          <dbl> 50.72727, 46.72727...
## $ heading_accuracy_mean_away   <dbl> 63.00000, 58.90909...
## $ short_passing_mean_away      <dbl> 67.09091, 61.72727...
## $ volleys_mean_away            <dbl> 62.50000, 48.63636...
## $ dribbling_mean_away          <dbl> 61.81818, 54.00000...
## $ curve_mean_away              <dbl> 53.20000, 45.36364...
## $ free_kick_accuracy_mean_away <dbl> 43.63636, 48.54545...
## $ long_passing_mean_away       <dbl> 65.36364, 57.90909...
## $ ball_control_mean_away       <dbl> 67.90909, 62.54545...
## $ acceleration_mean_away       <dbl> 72.90909, 68.18182...
## $ sprint_speed_mean_away       <dbl> 74.72727, 67.90909...
## $ agility_mean_away            <dbl> 69.00000, 58.45455...
## $ reactions_mean_away          <dbl> 70.54545, 68.09091...
## $ balance_mean_away            <dbl> 73.00000, 63.72727...
## $ shot_power_mean_away         <dbl> 56.45455, 59.54545...
## $ jumping_mean_away            <dbl> 67.40000, 70.36364...
## $ stamina_mean_away            <dbl> 74.27273, 71.54545...
## $ strength_mean_away           <dbl> 68.27273, 66.81818...
## $ long_shots_mean_away         <dbl> 51.63636, 51.63636...
## $ aggression_mean_away         <dbl> 69.63636, 69.00000...
## $ interceptions_mean_away      <dbl> 71.00000, 65.27273...
## $ positioning_mean_away        <dbl> 68.18182, 63.90909...
## $ vision_mean_away             <dbl> 70.10000, 63.90909...
## $ penalties_mean_away          <dbl> 69.72727, 67.63636...
## $ marking_mean_away            <dbl> 49.81818, 53.27273...
## $ standing_tackle_mean_away    <dbl> 56.09091, 57.18182...
## $ sliding_tackle_mean_away     <dbl> 56.10000, 54.54545...
## $ gk_diving_mean_away          <dbl> 17.09091, 13.36364...
## $ gk_handling_mean_away        <dbl> 27.00000, 26.54545...
## $ gk_kicking_mean_away         <dbl> 65.36364, 57.90909...
## $ gk_positioning_mean_away     <dbl> 26.72727, 26.09091...
## $ gk_reflexes_mean_away        <dbl> 27.18182, 26.81818...

Correlation plots

Predictors about the home team:

Predictors about the away team:

Modeling

Home team wins glm (model 1)

Model on everything

glm(home_win ~ . , family = binomial(), data = epl4 %>% select(-outcome, -matches('goal|outcome'))) -> home_full_glm
summary(home_full_glm)
## 
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = epl4 %>% 
##     select(-outcome, -matches("goal|outcome")))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1272  -0.9972  -0.5865   1.0674   2.4146  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -0.836286   2.579394  -0.324 0.745773    
## cumulative_margin_home        0.014607   0.003996   3.655 0.000257 ***
## cumulative_margin_away       -0.012342   0.003999  -3.086 0.002026 ** 
## overall_rating_mean_home     -0.024619   0.056367  -0.437 0.662278    
## potential_mean_home           0.042015   0.040867   1.028 0.303907    
## crossing_mean_home            0.001653   0.017336   0.095 0.924050    
## finishing_mean_home          -0.008298   0.018770  -0.442 0.658431    
## heading_accuracy_mean_home    0.003069   0.020429   0.150 0.880589    
## short_passing_mean_home       0.007477   0.031274   0.239 0.811043    
## volleys_mean_home             0.031482   0.015981   1.970 0.048850 *  
## dribbling_mean_home           0.003786   0.022912   0.165 0.868762    
## curve_mean_home              -0.008841   0.016549  -0.534 0.593189    
## free_kick_accuracy_mean_home  0.035521   0.014564   2.439 0.014727 *  
## long_passing_mean_home        0.011767   0.025291   0.465 0.641737    
## ball_control_mean_home       -0.021327   0.039116  -0.545 0.585604    
## acceleration_mean_home       -0.009251   0.029511  -0.313 0.753910    
## sprint_speed_mean_home        0.001683   0.028344   0.059 0.952647    
## agility_mean_home            -0.006141   0.020909  -0.294 0.768998    
## reactions_mean_home           0.049184   0.029570   1.663 0.096253 .  
## balance_mean_home             0.003522   0.015782   0.223 0.823418    
## shot_power_mean_home         -0.015660   0.019497  -0.803 0.421881    
## jumping_mean_home             0.030277   0.016578   1.826 0.067793 .  
## stamina_mean_home            -0.005951   0.019266  -0.309 0.757396    
## strength_mean_home            0.017964   0.021563   0.833 0.404801    
## long_shots_mean_home         -0.023499   0.020291  -1.158 0.246821    
## aggression_mean_home         -0.007986   0.015100  -0.529 0.596878    
## interceptions_mean_home       0.004172   0.018856   0.221 0.824893    
## positioning_mean_home        -0.005696   0.019206  -0.297 0.766789    
## vision_mean_home              0.022139   0.019647   1.127 0.259820    
## penalties_mean_home           0.015169   0.014267   1.063 0.287681    
## marking_mean_home            -0.015385   0.023646  -0.651 0.515275    
## standing_tackle_mean_home     0.011226   0.029581   0.380 0.704307    
## sliding_tackle_mean_home     -0.037010   0.023222  -1.594 0.110990    
## gk_diving_mean_home           0.050944   0.039854   1.278 0.201156    
## gk_handling_mean_home         0.076229   0.046216   1.649 0.099067 .  
## gk_kicking_mean_home          0.009853   0.016814   0.586 0.557900    
## gk_positioning_mean_home     -0.034341   0.046920  -0.732 0.464222    
## gk_reflexes_mean_home        -0.084819   0.046099  -1.840 0.065778 .  
## overall_rating_mean_away     -0.029729   0.056201  -0.529 0.596817    
## potential_mean_away          -0.042012   0.039831  -1.055 0.291533    
## crossing_mean_away           -0.009620   0.016818  -0.572 0.567305    
## finishing_mean_away           0.001267   0.018399   0.069 0.945119    
## heading_accuracy_mean_away   -0.014740   0.019712  -0.748 0.454609    
## short_passing_mean_away      -0.028182   0.030992  -0.909 0.363179    
## volleys_mean_away             0.014514   0.015491   0.937 0.348805    
## dribbling_mean_away           0.039078   0.022972   1.701 0.088920 .  
## curve_mean_away               0.013912   0.016340   0.851 0.394541    
## free_kick_accuracy_mean_away -0.011804   0.014059  -0.840 0.401138    
## long_passing_mean_away       -0.004881   0.024790  -0.197 0.843920    
## ball_control_mean_away       -0.045529   0.039363  -1.157 0.247423    
## acceleration_mean_away        0.035373   0.029210   1.211 0.225888    
## sprint_speed_mean_away       -0.059708   0.028065  -2.128 0.033377 *  
## agility_mean_away             0.020644   0.020371   1.013 0.310853    
## reactions_mean_away           0.013291   0.028801   0.461 0.644447    
## balance_mean_away            -0.021616   0.015693  -1.377 0.168387    
## shot_power_mean_away         -0.007514   0.019397  -0.387 0.698481    
## jumping_mean_away             0.019873   0.016817   1.182 0.237324    
## stamina_mean_away            -0.003418   0.019262  -0.177 0.859174    
## strength_mean_away            0.041473   0.021206   1.956 0.050501 .  
## long_shots_mean_away         -0.008512   0.020198  -0.421 0.673461    
## aggression_mean_away          0.006474   0.014975   0.432 0.665536    
## interceptions_mean_away      -0.020670   0.018659  -1.108 0.267959    
## positioning_mean_away         0.003029   0.018353   0.165 0.868900    
## vision_mean_away              0.009772   0.019392   0.504 0.614330    
## penalties_mean_away          -0.010008   0.013918  -0.719 0.472074    
## marking_mean_away             0.008284   0.023418   0.354 0.723517    
## standing_tackle_mean_away     0.018267   0.029391   0.622 0.534264    
## sliding_tackle_mean_away     -0.016159   0.023148  -0.698 0.485139    
## gk_diving_mean_away          -0.001992   0.038098  -0.052 0.958308    
## gk_handling_mean_away        -0.035461   0.045440  -0.780 0.435156    
## gk_kicking_mean_away          0.015038   0.016713   0.900 0.368237    
## gk_positioning_mean_away      0.001995   0.046934   0.043 0.966097    
## gk_reflexes_mean_away        -0.005658   0.045912  -0.123 0.901917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.1  on 3039  degrees of freedom
## Residual deviance: 3695.4  on 2967  degrees of freedom
## AIC: 3841.4
## 
## Number of Fisher Scoring iterations: 4
cat('AIC: ')
## AIC:
AIC(home_full_glm)
## [1] 3841.443
cat('full model, 72 predictors\n')
## full model, 72 predictors
caret::confusionMatrix(table(predict(home_full_glm, epl4) > 0, full$home_win))
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE  1214  586
##   TRUE    436  804
##                                           
##                Accuracy : 0.6638          
##                  95% CI : (0.6467, 0.6806)
##     No Information Rate : 0.5428          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3169          
##  Mcnemar's Test P-Value : 3.15e-06        
##                                           
##             Sensitivity : 0.7358          
##             Specificity : 0.5784          
##          Pos Pred Value : 0.6744          
##          Neg Pred Value : 0.6484          
##              Prevalence : 0.5428          
##          Detection Rate : 0.3993          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.6571          
##                                           
##        'Positive' Class : FALSE           
## 

Reduced model

glm(home_win ~ . , family = binomial(), data = full %>% select(-outcome, -matches('goal'))) -> home_glm
AIC(home_glm)
## [1] 3782.579
summary(home_glm)
## 
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = full %>% 
##     select(-outcome, -matches("goal")))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1203  -1.0188  -0.6208   1.0809   2.2515  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.623131   1.591838  -1.020    0.308    
## cumulative_margin_home    0.016625   0.003808   4.366 1.27e-05 ***
## cumulative_margin_away   -0.015635   0.003807  -4.107 4.02e-05 ***
## overall_rating_mean_home  0.113161   0.014920   7.585 3.34e-14 ***
## overall_rating_mean_away -0.094396   0.014637  -6.449 1.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.1  on 3039  degrees of freedom
## Residual deviance: 3772.6  on 3035  degrees of freedom
## AIC: 3782.6
## 
## Number of Fisher Scoring iterations: 4
predict(home_glm, full) -> home_preds
cat('full model, 72 predictors ')
## full model, 72 predictors
caret::confusionMatrix(full$home_win, home_preds > 0)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1210  440
##      TRUE    611  779
##                                           
##                Accuracy : 0.6543          
##                  95% CI : (0.6371, 0.6712)
##     No Information Rate : 0.599           
##     P-Value [Acc > NIR] : 2.027e-10       
##                                           
##                   Kappa : 0.2966          
##  Mcnemar's Test P-Value : 1.573e-07       
##                                           
##             Sensitivity : 0.6645          
##             Specificity : 0.6390          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.5604          
##              Prevalence : 0.5990          
##          Detection Rate : 0.3980          
##    Detection Prevalence : 0.5428          
##       Balanced Accuracy : 0.6518          
##                                           
##        'Positive' Class : FALSE           
## 
# Summary plot
par(mfrow = c(2,2))
plot(home_glm)

Model 2

Multinomial logistic regression using nnet package

## # weights:  18 (10 variable)
## initial  value 3339.781358 
## iter  10 value 2987.333352
## final  value 2986.500537 
## converged
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    D    H
##          A  438    0  429
##          D  237    2  544
##          H  255    0 1135
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5181         
##                  95% CI : (0.5002, 0.536)
##     No Information Rate : 0.6934         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1908         
##  Mcnemar's Test P-Value : <2e-16         
## 
## Statistics by Class:
## 
##                      Class: A  Class: D Class: H
## Sensitivity            0.4710 1.0000000   0.5384
## Specificity            0.7967 0.7429230   0.7264
## Pos Pred Value         0.5052 0.0025543   0.8165
## Neg Pred Value         0.7736 1.0000000   0.4103
## Prevalence             0.3059 0.0006579   0.6934
## Detection Rate         0.1441 0.0006579   0.3734
## Detection Prevalence   0.2852 0.2575658   0.4572
## Balanced Accuracy      0.6338 0.8714615   0.6324

Model 3

Predicting the scores (Poission)

for both the home team and the away team

## 
## Call:
## glm(formula = home_team_goal ~ ., family = poisson(), data = full %>% 
##     select(-away_team_goal, -outcome, -home_win))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5051  -0.8850  -0.1596   0.5343   3.7227  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.051697   0.610270  -1.723 0.084829 .  
## cumulative_margin_home    0.004682   0.001302   3.595 0.000325 ***
## cumulative_margin_away   -0.006371   0.001398  -4.558 5.17e-06 ***
## overall_rating_mean_home  0.045638   0.005495   8.306  < 2e-16 ***
## overall_rating_mean_away -0.026601   0.005475  -4.859 1.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3789.7  on 3039  degrees of freedom
## Residual deviance: 3396.9  on 3035  degrees of freedom
## AIC: 9275.9
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = away_team_goal ~ ., family = poisson(), data = full %>% 
##     select(-home_team_goal, -outcome, -home_win))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4469  -1.2986  -0.1242   0.5701   3.1087  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.019710   0.703846  -1.449  0.14740    
## cumulative_margin_home   -0.004480   0.001622  -2.763  0.00573 ** 
## cumulative_margin_away    0.001785   0.001518   1.176  0.23959    
## overall_rating_mean_home -0.039281   0.006433  -6.106 1.02e-09 ***
## overall_rating_mean_away  0.054050   0.006260   8.635  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3886.2  on 3039  degrees of freedom
## Residual deviance: 3571.6  on 3035  degrees of freedom
## AIC: 8367.6
## 
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE   684  255
##   TRUE    966 1135
##                                           
##                Accuracy : 0.5984          
##                  95% CI : (0.5807, 0.6158)
##     No Information Rate : 0.5428          
##     P-Value [Acc > NIR] : 3.635e-10       
##                                           
##                   Kappa : 0.2221          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4145          
##             Specificity : 0.8165          
##          Pos Pred Value : 0.7284          
##          Neg Pred Value : 0.5402          
##              Prevalence : 0.5428          
##          Detection Rate : 0.2250          
##    Detection Prevalence : 0.3089          
##       Balanced Accuracy : 0.6155          
##                                           
##        'Positive' Class : FALSE           
## 

Our model as a betting agent | Further evaluation

We decided to see hour our multinomial model would perform as a betting agent, ie placing bets on every game:

## Future

Our data ended up being very high dimensional, We could explore methods of reducing the dimensionality of our dataset (with PCA)

We should have looked for more varied data, from different sources, stats from the actual games themselves.

Packages used

Data manipulation:

Graphics:

Confusion matrix: